{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Model Calibration\n",
    "\n",
    "## (a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that for logistic regression the $j$-th component of the gradient of the log-likelihood at the maximum likelihood $\\theta$ is given by "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ 0 = \\nabla_\\theta \\ell(\\theta) = \\sum_{i=1}^m x_j^{(i)}(y^{(i)}- h_\\theta(x^{(i)}).$$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In particular for $j=0$ we have $x_j^{(i)}=1$ for all $i$, so that we can conclude"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\sum_{i=1}^m P(y^{(i)}=1\\mid x^{(i)};\\theta)=\\sum_{i=1}^m h_\\theta(x^{(i)})= \\sum_{i=1}^m y^{(i)} = \\sum_{i=1}^m \\mathbb 1\\{y^{(i)}=1\\}.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The claim follows because for $(a,b)=(0,1)$ we have $I_{a,b} = \\{1,\\ldots, m\\}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (b)\n",
    "\n",
    "This is false: \n",
    "For example if we have as many positive as negative examples in our training set, then a model that always predicts $\\frac {1}2$ is perfectly calibrated.\n",
    "\n",
    "The converse is also false:\n",
    "If the model computes $P(y^{(i)}=1 \\mid x^{(i)};\\theta) = \\frac 1 3$ for every negative example and $P(y^{(i)}=1 \\mid x^{(i)};\\theta) = \\frac 2 3$ for every positive example, then it would have perfect accuracy.\n",
    "But for $(a,b)=(0,0.5)$ we would get\n",
    "$$ \\frac {\\sum_{i\\in I_{a,b}} P(y^{(i)}=1 \\mid x^{(i)};\\theta)}{|I_{a,b}|} = \\frac 1 3 \\not=0 =  \\frac {\\sum_{i\\in I_{a,b}}  \\mathbb 1\\{y^{(i)}=1\\}}{|I_{a,b}|}. $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (c)\n",
    "\n",
    "If we only apply regularization to $\\theta_1,\\ldots, \\theta_n$ but not to $\\theta_0$, then our reasoning from (a) still holds, i.e. the model will be calibrated at least for $(a,b)=(0,1)$.\n",
    "\n",
    "If we also regularize $\\theta_0$ this isn't necessarily true anymore because the $0$-th component of the gradient of the regularized loss function $J(\\theta)=-\\ell(\\theta) + \\frac \\lambda  2\\lVert \\theta\\rVert_2^2$ will be given by "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ 0 = -\\sum_{i=1}^m (y^{(i)}- h_\\theta(x^{(i)}) + \\lambda \\theta_0,$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where $\\lambda$ is the regularization factor."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
